An optimized GAN method based on the Que-Attn and contrastive learning for underwater image enhancement

Research on underwater image processing has increased significantly in the past decade due to the precious resources that exist underwater. However, it is still a challenging problem to restore degraded underwater images. Existing prior-based methods show limited performance in many cases due to their reliance on hand-crafted features. Therefore, in this paper, we propose an effective unsupervised generative adversarial network(GAN) for underwater image restoration. Specifically, we embed the idea of contrastive learning into the model. The method encourages two elements (corresponding patches) to map the similar points in the learned feature space relative to other elements (other patches) in the data set, and maximizes the mutual information between input and output through PatchNCE loss. We design a query attention (Que-Attn) module, which compares feature distances in the source domain, and gives an attention matrix and probability distribution for each row. We then select queries based on their importance measure calculated from the distribution. We also verify its generalization performance on several benchmark datasets. Experiments and comparison with the state-of-the-art methods show that our model outperforms others.


Introduction
Compared with the depleted land energy, the use of the marine energy is still at its early stage. The rich marine energy contains not only abundant natural mineral energy such as petroleum but also sustainable renewable energy such as tidal and wave energy, which provides the possibility for the long-term development of human beings. In order to make full use of the marine energy, underwater image processing technology is critical. However, in the underwater environment, the captured images are often degraded by blurness, color deviation and low contrast. For example, as light travels through water, red light with longer wavelengths than green and blue light is absorbed more quickly, so underwater images often appear in a typical blue or green hue, As shown in Fig 1. In addition, large amounts of suspended particles tend to change the direction of light in the water, resulting in blurred images. Excellent underwater image enhancement is expected to improve the visual quality of input images by improving visibility, eliminating chromatic aberration and correcting low contrast. At the same time, enhanced visibility can make scenes and objects of interest stand out, providing better preliminary information for high-level computer vision tasks such as object detection and recognition.
In the early stage of underwater image enhancement, pixels are reassigned in a single image. However, the redistributed image is prone to oversaturation and is highly dependent on prior knowledge. The emergence of deep learning solves these problems to a certain extent. However, underwater image algorithms based on deep learning usually require a large amount of datasets, especially paired datasets, but they are difficult to be collected. Studies have shown that underwater image enhancement using unsupervised learning can perform well in processing unpaired and unlabeled data sets. Of which, the contrastive method can compare the data with positive and negative samples in the feature space to obtain better sample feature representation. The deep learning model of generative adversarial network has a very powerful image enhancement effect. The combination of unsupervised contrastive loss and generative adversarial network can approximate the underwater enhanced image more closer to the real image. Therefore, this paper proposes to combine unsupervised contrastive learning with generative adversarial network for underwater image enhancement. Besides, an attention mechanism is introduced to further improve the underwater image enhancement effect. The main contributions of this paper are as follows: • We propose an underwater image enhancement method based on unsupervised contrastive learning. This method combines unsupervised contrastive learning with generative adversarial network, which can maximize the mutual information between the original image and the restored image, loose the constraint on the large number of paired data sets, and produce clearer restored image.
• We propose an attention-querying mechanism for underwater image enhancement tasks. We select relevant anchor points first, and then use them as queries to focus on in order to absorb the features of other locations, and finally form better features suitable for Graphical demonstration of attenuation rates corresponding to different wavelengths when light propagates in water. Blue travels the longest because it has the shortest wavelength. This is one of the main reasons why underwater images often appear blue [1].
https://doi.org/10.1371/journal.pone.0279945.g001 comparative learning. Our proposed attention-querying mechanism preserves the simple design of CUT and does not increase any model parameters.
• We conduct multiple experiments on multiple public underwater data sets to prove our method can restore clearer images with higher quality of detail, and does not consume more computing resources. In summary, we prove that our method has competitive performance in terms of multiple evaluation indicators.

Underwater image enhancement
Image processing-based methods mainly adjust the pixel value of underwater images to improve the visual quality, including pixel value adjustment, retinal decomposition and image fusion. Zhang et al. [8] proposed an extended multi-scale Retinex underwater image enhancement method, including three steps, i.e. color correction, layer decomposition and enhancement for the input image. Ancuti et al. [9] proposed a new multi-scale fusion strategy that blends color compensation and white balance versions of a given image to produce better results. Recently, based on the severely uneven spectral distribution of underwater images, Ancuti et al. [10] proposed a new method which introduced a new color channel compensation preprocessing step in the opposite end color channel to overcome the artifacts. In summary, the image processing-based approach can improve the visual effect to a certain extent. However, they often fail to provide high-quality results in some complex scenarios due to ignorance of the domain knowledge in the field of underwater imaging. Most physics-based approaches are based on underwater imaging models in which background light and transmission images are estimated by some prior. Prior knowledge includes underwater dark channel prior knowledge, minimum information prior knowledge and color line prior knowledge etc. Cosman et al. [13] proposed an underwater image restoration method combining fuzzy priori to estimate the scene depth more accurately. Inspired by the principle of minimum information loss, Li et al. [12] proposed a minimum information loss criterion. Guo et al. [12] estimated the optimal transmission graph to recover the underwater image and used histogram distribution to effectively improve contrast and brightness. Recently, Berman et al. [14] fused color line prior information and multispectral profile information of different water types into physical models, and optimized them using the grey world hypothesis theory, achieving good image denoising effect. These methods can restore underwater images well in some cases. However, when the prior information is invalid, some regions will inevitably suffer from unwanted artifacts and color deflections.
In recent years, with the development of deep learning, learning-based methods have made great progress in underwater image enhancement. There are many ways to improve performance by training their models on real underwater images. For example, in order to relax the need for paired training data, Li et al. [19] proposed a weakly supervised underwater color transfer model based on cyclic Uniform Generative Adversarial Network (CycleGAN) and real data. As a pioneering work, Li et al. [20] constructed a real underwater image enhancement dataset with a total of 950 pairs of original and reference underwater images. The reference images were generated by 12 enhancement algorithms and scored by 50 volunteers to select the final result. With these images, Chen et al. [21] designed a threshold fusion network in which they learned three confidence graphs and fused the three preprocessed versions into an improved model. Recently, Li et al. [22] developed an underwater image enhancement network in a medium transport-guided multicolor space to achieve more robust enhancement. In a word, the restoration effect of the above methods on color and texture details is still unsatisfactory, and may produce unrealistic results. Luckily, we have achieved positive results by combining contrastive learning with attention mechanism.
There are also many algorithms that use data generated from generative adversarial networks or physical models to train their networks. For example, combining with the domain knowledge of underwater imaging, Li et al. [17] designed a generative adversarial network for generating underwater images from aerial images and depth maps, and then using these generated data to correct color projections in a supervised manner. Fabbri et al. [18] directly used CycleGAN to generate pairs of training data and then trained the full convolution codec to improve underwater image quality. In addition, Li et al. [19] proposed to synthesize 10 types of underwater images based on the underwater imaging model and some scene parameters. Using those synthetic data, Li et al. [20] proposed a new approach. Anwar et al. [22] proposed an end-to-end model that first directly restores a clear underwater dive image and then performs post-processing to improve subjective visual effect. Dudhane et al. [23] improved their work by introducing target fuzziness and color shift components to synthesize more accurate underwater image data.

Self-supervised learning
Despite the great success of supervised learning, deep neural networks have been criticized for requiring large amounts of labeled training data. Recent research on self-supervised learning has shown that it has a strong ability to represent unlabeled images, especially with the help of contrastive loss [24,25]. The idea is to extract features from the same image and push out features from different images to perform instance level discrimination and learn feature embedding. More recently, it has been studied as a pre-training technique that provides an initial model or potential embedding for underlying computer vision tasks [26-31]. Self-supervised learning is also applied to image generation. SS-GAN [32] uses rotation prediction as an auxiliary task of the discriminator to prevent over-fitting that is due to the limited true/false binary classification data. LT-GAN [33] trains an auxiliary classifier based on the embedded discriminator to classify whether the two pairs of false images have the same disturbance in the sampling noise vector. In addition to CUT, Kang et al. also adopted I2I's self-comparison learning. It uses a non-local [34] attention matrix to distort the target image to the source attitude and requires the features in the distorted image to approach the source through contrastive loss.

Problem setup
Underwater small target detection and recognition based on optics is the key to the intelligent operation of the underwater fishing robot. However, underwater target detection and recognition technology based on optical vision also face significant challenges. The main reason is that the underwater image obtained by the visual vision system is seriously degraded due to the complex ocean imaging environment (the decline of underwater images mainly includes color deviation caused by underwater absorption of light-wave, refraction of light caused by forwarding scattering, and backward scattering, blurring of the imaged image, low contrast, obstruction of light by particulate matter in water, etc.), there are phenomena such as color fading, low contrast, and blurred details.
The underwater imaging equipment tells the development of underwater imaging, but it is more necessary to acquire high-quality underwater images using the state-of-the-art equipment in underwater exploration. The general imaging system will have the phenomenon of color degradation, loss of image detail texture, and blur in image imaging. More importantly, high-quality image acquisition requires a lot of costs. Underwater image enhancement is significant for obtaining high-quality underwater images for research. The

Materials and methods
We define two fields X 2 R H�W�C and Y 2 R H�W�C , given an image I x 2 R H�W�C from source domain X which represents real images, given an image I y 2 R H�W�C from recovery domain Y. And our goal is to find the mapping G : X ! Y to achieve underwater image recovery. Our model consists of a generator G and a discriminator D, of which G realizes the mapping from domain X to domain Y, while D guarantees that the translated image belongs to the right image domain. We denote the first part of the generator as an encoder and the second part as a decoder, written as G enc and G dec . In the mapping process, we extract the features of the image using several layers of the encoder before passing the extracted features to a 2-layer MLP projection head (represented by function H). Here the projection head learning projects features extracted from the encoder onto a bunch of features. As shown in Fig 3, we give the schematic diagram of our model.

Attention mechanism
For neural network learning, it is known to all, that the more parameters the network has, the stronger the representative ability the network exhibits, the more information the network stores, although this will incur information overload problem. Therefore, the attention mechanism can be introduced to focus on the feature which is more important to the task among numerous input features. By eliminating the need to pay attention to unimportant features, or filtering out features that have no contribution to the ground truth, we can therefore achieve data dimension reduction, and further promote the completing time and accuracy of the current task.
Querying attention. In the process of image processing, we introduce the procedure of attention querying for our underwater image enhancement model on the basis of unsupervised contrastive learning. Feature F x and F y is extracted from I x and G(I x ) of encoder G enc . And then we reconstruct and calculate F x to obtain the attention matrix A g . In A g , each row is sorted by significance level, and N rows are selected to form query attention matrix A QA . We then further apply routing to the feature value of the source and target domain, and obtain positive, negative and anchor features to construct the contrastive loss L con . The positive and negative features are from the real image I x , and the anchor features are from the translation image G (I x ). Orange, blue, and green patches represent positive, negative, and anchor points, respectively. Some features do not reflect domain characteristics and are often retained during the transformation process. Therefore, the L con applied to them is not important to q. Our goal is to select the anchor point q and compute L con at important anchor points that contain more domain-specific information. Our other goal is to define a quantified value for each potential location that reflects the significance of the feature. The quadratic attention matrix is used because it accurately reflects the similarity of each feature to others since it makes exhaustive comparisons with all other positions. As shown in Fig 4, we give the schematic diagram of attention block.
Quering sets filtering. The CUT network randomly selects anchor, positive sample, and negative sample to calculate the contrastive loss, which is inefficient because it needs to calculate the distance between positive samples and all negative samples, and their corresponding patches may not come from domain-related regions. Note that some features do not reflect domain characteristics and are often retained during the transformation process. Therefore, the contrastive loss imposed on them is not important. Our goal is to select anchor and calculate contrastive loss at important anchors that contain more domain-specific information.
Global attention and local attention. Based on the above observations, our goal is to define a quantified value for each potential location that reflects the importance of that feature. The quadratic attention matrix is used because it exhaustively compares each feature with all the others, and can accurately reflects its similarity with the other features. In particular, given a feature F x 2 R H�W�C in the source domain, we first reshape it into a 2D matrix Q 2 R HW�C , and then multiply it by its transpose K 2 R C�HW . We then assign each row of the matrix a Softmax function to obtain the global attention matrix A g 2 R HW�HW . Therefore, important characteristics can be measured according to the entropy of each row in the global attention matrix, which is computed as in Eq 1.
Of which i and j are the indexes of the query and key, corresponding to the row number and column number in A g . M g (i) approaches 0 means that there are very few key locations in row that are similar to query. Therefore, we assume that it is sufficiently significant and, importantly, constrained by the contrastive loss. In order to select all meaningful queries, the A g rows are sorted in ascending order by entropy M g , and the smallest N rows are selected as the query attention matrix A QA 2 R N�HW . Note that A QA is fully determined by features in I x , and is not relevant to G(I x ). Although global attention can help to obtain the global context and smooth the detailed context around the query, the computation cost of the query is high. Therefore, we combine the global attention and local attention in order to reduce computational cost. Local attention measures the similarity between the query and its adjacent keywords in a fixed window of w × w and step size of 1, which can capture the spatial interaction of local areas and reduce the computation cost. Given the reconstructed query matrix Q l 2 R HW�C , we multiply it by the local keyword matrix K l 2 R HW�w 2 �C and send it to the Softmax function to obtain the local attention matrix A l 2 R HW�w 2 . Local entropy M l is calculated in each row, as shown in the following Eq 2.
Here i and j are the indexes of the query and key. We select the smallest N rows in A l by sorting M l in ascending order to form query attention matrix A QA . For value routing, we also locate N indices in the local value matrix V l 2 R HW�w 2 �C and obtain the value matrix

Loss function
The adversarial loss is used to encourage the generator to generate images that are visually similar to the target domain image. For the mapping G : X ! Y with discriminator D, the loss of the generative adversarial network is calculated as follows: where G tries to obtain the same image G(x) as the image in the Y domain, while D aims to distinguish the generated sample G(x) from the real data Y. Our goal is to maximize the mutual information between the patches corresponding to the input and output. For example, for the patch in the generated recovery image, we should be able to associate it more strongly with the same patch in the original input underwater image than with other patches in the image. Therefore, we use a noisy contrastive estimation framework to maximize the mutual information between the inputs and outputs. The basic idea behind contrastive learning is to connect the two kinds of information, and the contrastive method is to learn the characteristic representation of samples by comparing the data with positive samples and negative samples in the feature space. We map the query, positive number and N negative numbers to a k-dimensional vector, which are respectively represented as v, v + 2 R K , and v − 2 R N×K . Note that v À n 2 R K is the n th negative number. We establish a (N + 1) classification problem and calculate the probability that a "positive" is chosen over "negatives". Mathematically, this can be expressed as a cross entropy loss, calculated by follow Eq 4.
Of which sim(u, v) = v T v/kukkkvk is the cosine similarity between u and v. τ denotes the temperature parameter used to scale the distance between queries and other instances. We use 0.07 as the default value of τ and 255 as the default negative number.
We use a 2-layer projection head H to extract features from domain X. We first select L layers from G GEN (X) to send to H, embedding an image into the feature stack fz l g L ¼ fH l ðG l enc ðxÞÞg L . It represents the output of the selected L layers. After obtaining a bunch of features, each feature actually represents a patch in the image. Therefore, we denote the spatial position in each selected layer as s 2 {1, . . ., S l }, where l is the number of spatial positions of each layer. We select one query at a time and denote the corresponding positive feature as Z S l 2 R c 1 , and all other negative features as Z S=s l 2 R ðslÀ 1Þ�c 1 , where c 1 is the number of channels in each layer. Our goal is to match the patches corresponding to the input and output images. Therefore, we can use the following equation to represent the block-based multilayer contrastive loss of the mapping G : X ! Y.
The generated restored image should be realistic L GAN , and the patches in the corresponding input original image and generated restored image should share a corresponding L PatchNCE . The resulting restored image should share the same structure as the original input image, and the overall loss is: Experimental setup

Datasets
In our experimental setup, we conduct experiments on multiple datasets to test our model's performance and verify the robustness of our model on different datasets. UIEB The UIEB dataset [20] is a composite dataset that consists of 890 paired images, with another 60 challenging images to be restored without ground-truth values. Among them, we select a random subset of 800 images as the training set, and the remaining 90 images as the test set. The sizes of the images are 860 × 590.
HICRD HICRD [35] is a composite dataset that contains 9, 676 original underwater images and 2, 000 restored reference images. It contains two subsets, unpaired HICRD and paired HICRD. It also contains measured parameters such as diffuse attenuation coefficients and camera sensor responses.
EUVP [36] This dataset is a composite dataset that is collected by various cameras, such as GoPros, micro-light USB, and other cameras during ocean exploration under varying visibility conditions. It includes 5, 550 paired test data, 3, 200 pairs of unpaired test data, and 515 pairs of test datasets. The resolution of the images is 256 × 256. We use 3, 200 unpaired data to train our network.

The evaluation metrics
We consider two commonly used computer vision evaluation indicators, peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), to quantitatively analyze the performance of our model. PSNR calculates the quality of the generated image by the sum of the mean squared errors (MSE).
In order to verify the performance of underwater color recovery, we also use the no-reference underwater image quality measure (UIQM) [37] to analyze the quality of the output image. UIQM includes three underwater image property measures: underwater image colorfulness measure (UICM), underwater image sharpness measure (UISM), and underwater image contrast measure (UIConM). Each attribute evaluates one aspect of underwater image degradation.

Experimental parameter setting
Our network is trained for 200 epochs at the same learning rate of 0.0002. The speed of learning declines linearly after half an epoch. We load all images with a resolution of 800 × 800 and divide them randomly into 512 × 512 blocks during training. We use images of 1680 × 892 resolution for all methods. For our network, we use spectral normalization for discriminators and instance normalization for generators. We set the batch size to 1, and use ADAM optimizer for optimization. We set β 1 = 0.5 and β 1 = 0.999. We use the TeslaV100 − 32GB GPU to train our method and other baselines.

Results
In order to objectively evaluate the restoration quality of our method, SSIM and PSNR are selected to evaluate the enhancement performance on EUVP datasets partially-restored images, degradated images and clear images. At the same time, the underwater image quality evaluation index UIQM is introduced to measure the color measure (UICM), sharpness measure (UISM) and contrast measure (UIConM) of the generated image. The results are shown in Table 1, and we can see that our method has obtained good results.
At the same time, we conduct experiments on UIEB datasets, among which 3200 pairs of unpaired datasets are selected for training and 515 test datasets are used for testing. The obtained results exhibit good performance in SSIM, PSNR and UIQM, as shown in Table 2.
In order to verify the generalization ability of our model, we also conduct experiments on the high-resolution dataset HICRD. The HICRD dataset has a resolution of 1842 × 980. We analyze our experimental results quantitatively and qualitatively, and show that our method not only lead in many evaluation metrics, but also exhibit good visual performance as shown in Table 3. And we give the visual effect as shown in Fig 5.

Ablation experiments
In order to further prove the superiority of our network structure, we conduct experiments on different modules of our proposed network. We mainly focus on: 1 whether contrastive learning is used and 2 the use of query attention mechanism. We test the effect of these modules on the experimental results. Table 4 shows the experimental results. And from it, we can see our model has asymptotic property.

Conclusion
This paper proposes an underwater image enhancement model. The network is an end-to-end unsupervised generative adversarial network. This network uses contrastive learning and attention mechanism to complement the original network. Our qualitative and quantitative analysis of multiple datasets show that our network has competitive results in the generated recovered images. The experiments on real datasets also show that our network is more robust.